59 research outputs found

    Ensemble Data Mining Methods

    Get PDF
    Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods that leverage the power of multiple models to achieve better prediction accuracy than any of the individual models could on their own. The basic goal when designing an ensemble is the same as when establishing a committee of people: each member of the committee should be as competent as possible, but the members should be complementary to one another. If the members are not complementary, Le., if they always agree, then the committee is unnecessary---any one member is sufficient. If the members are complementary, then when one or a few members make an error, the probability is high that the remaining members can correct this error. Research in ensemble methods has largely revolved around designing ensembles consisting of competent yet complementary models

    Classification

    Get PDF
    A supervised learning task involves constructing a mapping from input data (normally described by several features) to the appropriate outputs. Within supervised learning, one type of task is a classification learning task, in which each output is one or more classes to which the input belongs. In supervised learning, a set of training examples---examples with known output values---is used by a learning algorithm to generate a model. This model is intended to approximate the mapping between the inputs and outputs. This model can be used to generate predicted outputs for inputs that have not been seen before. For example, we may have data consisting of observations of sunspots. In a classification learning task, our goal may be to learn to classify sunspots into one of several types. Each example may correspond to one candidate sunspot with various measurements or just an image. A learning algorithm would use the supplied examples to generate a model that approximates the mapping between each supplied set of measurements and the type of sunspot. This model can then be used to classify previously unseen sunspots based on the candidate's measurements. This chapter discusses methods to perform machine learning, with examples involving astronomy

    Ask-The-Expert: Minimizing Human Review for Big Data Analytics Through Active Learning

    Get PDF
    In this CIF project, we worked toward semi-automating knowledge discovery from anomaly detection algorithms through the use of active learning. Active learning is an area of research within machine learning that uses an "expert in the loop" to learn from large data sets that have very few annotations or labels available, and where providing such labels is expensive. In our case, the task can be defined as the identification of safety events from flight operational data. Since traditional anomaly detection algorithms cannot differentiate between operationally relevant and irrelevant statistical anomalies, Subject Matter Experts (SMEs) have a lengthy and expensive burden of investigating every example identified by the detection algorithm, classifying and labeling them as relevant or irrelevant. Active learningidentifies the unlabeled example for which a label would most improve the classifier, asks the domain expert for a label, and repeats this process until there are no more resources (time, budget) available for labeling or a minimum required performance is reached. A positive label indicates an operationally significant safety event whereas a negative label indicates otherwise. Based on these few labels we propose to build an active learning system that utilizes the SME's time in the most effective manner by iteratively asking for labels for as few informative instances as possible. Our work was proposed to be a stepping stone toward implementation and deployment of the system with user interface to be pursued by the Aviation Operations and Safety Program (AOSP) given its interest in safety monitoring and discovery of safety incidents

    Multiple Kernel Learning for Heterogeneous Anomaly Detection: Algorithm and Aviation Safety Case Study

    Get PDF
    The world-wide aviation system is one of the most complex dynamical systems ever developed and is generating data at an extremely rapid rate. Most modern commercial aircraft record several hundred flight parameters including information from the guidance, navigation, and control systems, the avionics and propulsion systems, and the pilot inputs into the aircraft. These parameters may be continuous measurements or binary or categorical measurements recorded in one second intervals for the duration of the flight. Currently, most approaches to aviation safety are reactive, meaning that they are designed to react to an aviation safety incident or accident. In this paper, we discuss a novel approach based on the theory of multiple kernel learning to detect potential safety anomalies in very large data bases of discrete and continuous data from world-wide operations of commercial fleets. We pose a general anomaly detection problem which includes both discrete and continuous data streams, where we assume that the discrete streams have a causal influence on the continuous streams. We also assume that atypical sequence of events in the discrete streams can lead to off-nominal system performance. We discuss the application domain, novel algorithms, and also discuss results on real-world data sets. Our algorithm uncovers operationally significant events in high dimensional data streams in the aviation industry which are not detectable using state of the art method

    nu-Anomica: A Fast Support Vector Based Novelty Detection Technique

    Get PDF
    In this paper we propose nu-Anomica, a novel anomaly detection technique that can be trained on huge data sets with much reduced running time compared to the benchmark one-class Support Vector Machines algorithm. In -Anomica, the idea is to train the machine such that it can provide a close approximation to the exact decision plane using fewer training points and without losing much of the generalization performance of the classical approach. We have tested the proposed algorithm on a variety of continuous data sets under different conditions. We show that under all test conditions the developed procedure closely preserves the accuracy of standard one-class Support Vector Machines while reducing both the training time and the test time by 5 - 20 times

    Fast and Flexible Multivariate Time Series Subsequence Search

    Get PDF
    Multivariate Time-Series (MTS) are ubiquitous, and are generated in areas as disparate as sensor recordings in aerospace systems, music and video streams, medical monitoring, and financial systems. Domain experts are often interested in searching for interesting multivariate patterns from these MTS databases which often contain several gigabytes of data. Surprisingly, research on MTS search is very limited. Most of the existing work only supports queries with the same length of data, or queries on a fixed set of variables. In this paper, we propose an efficient and flexible subsequence search framework for massive MTS databases, that, for the first time, enables querying on any subset of variables with arbitrary time delays between them. We propose two algorithms to solve this problem (1) a List Based Search (LBS) algorithm which uses sorted lists for indexing, and (2) a R*-tree Based Search (RBS) which uses Minimum Bounding Rectangles (MBR) to organize the subsequences. Both algorithms guarantee that all matching patterns within the specified thresholds will be returned (no false dismissals). The very few false alarms can be removed by a post-processing step. Since our framework is also capable of Univariate Time-Series (UTS) subsequence search, we first demonstrate the efficiency of our algorithms on several UTS datasets previously used in the literature. We follow this up with experiments using two large MTS databases from the aviation domain, each containing several millions of observations. Both these tests show that our algorithms have very high prune rates (>99%) thus needing actual disk access for only less than 1% of the observations. To the best of our knowledge, MTS subsequence search has never been attempted on datasets of the size we have used in this paper

    Ask-the-expert: Active Learning Based Knowledge Discovery Using the Expert

    Get PDF
    Often the manual review of large data sets, either for purposes of labeling unlabeled instances or for classifying meaningful results from uninteresting (but statistically significant) ones is extremely resource intensive, especially in terms of subject matter expert (SME) time. Use of active learning has been shown to diminish this review time significantly. However, since active learning is an iterative process of learning a classifier based on a small number of SME-provided labels at each iteration, the lack of an enabling tool can hinder the process of adoption of these technologies in real-life, in spite of their labor-saving potential. In this demo we present ASK-the-Expert, an interactive tool that allows SMEs to review instances from a data set and provide labels within a single framework. ASK-the-Expert is powered by an active learning algorithm for training a classifier in the backend. We demonstrate this system in the context of an aviation safety application, but the tool can be adopted to work as a simple review and labeling tool as well, without the use of active learning

    Using ADOPT Algorithm and Operational Data to Discover Precursors to Aviation Adverse Events

    Get PDF
    The US National Airspace System (NAS) is making its transition to the NextGen system and assuring safety is one of the top priorities in NextGen. At present, safety is managed reactively (correct after occurrence of an unsafe event). While this strategy works for current operations, it may soon become ineffective for future airspace designs and high density operations. There is a need for proactive management of safety risks by identifying hidden and "unknown" risks and evaluating the impacts on future operations. To this end, NASA Ames has developed data mining algorithms that finds anomalies and precursors (high-risk states) to safety issues in the NAS. In this paper, we describe a recently developed algorithm called ADOPT that analyzes large volumes of data and automatically identifies precursors from real world data. Precursors help in detecting safety risks early so that the operator can mitigate the risk in time. In addition, precursors also help identify causal factors and help predict the safety incident. The ADOPT algorithm scales well to large data sets and to multidimensional time series, reduce analyst time significantly, quantify multiple safety risks giving a holistic view of safety among other benefits. This paper details the algorithm and includes several case studies to demonstrate its application to discover the "known" and "unknown" safety precursors in aviation operation

    Fast Multivariate Search on Large Aviation Datasets

    Get PDF
    Multivariate Time-Series (MTS) are ubiquitous, and are generated in areas as disparate as sensor recordings in aerospace systems, music and video streams, medical monitoring, and financial systems. Domain experts are often interested in searching for interesting multivariate patterns from these MTS databases which can contain up to several gigabytes of data. Surprisingly, research on MTS search is very limited. Most existing work only supports queries with the same length of data, or queries on a fixed set of variables. In this paper, we propose an efficient and flexible subsequence search framework for massive MTS databases, that, for the first time, enables querying on any subset of variables with arbitrary time delays between them. We propose two provably correct algorithms to solve this problem (1) an R-tree Based Search (RBS) which uses Minimum Bounding Rectangles (MBR) to organize the subsequences, and (2) a List Based Search (LBS) algorithm which uses sorted lists for indexing. We demonstrate the performance of these algorithms using two large MTS databases from the aviation domain, each containing several millions of observations Both these tests show that our algorithms have very high prune rates (>95%) thus needing actua
    corecore